Lovisa Franzén, lovisa.franzen@scilifelab.se
October, 2019
Budesonide, sold by AstraZeneca under the brand names Pulmicort and Symbicort, is a glucocorticoid used for long term management of asthma and chronic obstructive pulmonary disease (COPD). The compound can be administered through inhalation or as a pill, among other forms. It was sold for the first time as medication in 1981, and has since held a large part of the consumption market.
In a publication by Barrette et al. (2016)1, the authors studied the effect of budesonide treatment in primary human foetal lung explants. They prepared RNA from 3 samples of foetal lung at 23 weeks gestation before (preculture, PC) and after 4 days culture as explants with (Bud) or without (Way) budesonide (30 nM) and performed RNAseq on the 9 samples. The sequencing was performed using Illumina HiSeq 2500, and the processed data is publicly available at Gene Expression Omnibus (accession number GSE83888).
[1] Anne Marie Barrette, Jessica K. Roberts, Cheryl Chapin, Edmund A. Egan, Mark R. Segal, Juan A. Oses-Prieto, Shreya Chand, Alma L. Burlingame, and Philip L. Ballard, "Antiinflammatory Effects of Budesonide in Human Fetal Lung", *American Journal of Respiratory Cell and Molecular Biology*, 2016, link to article
First I set everything up by importing the necessary modules, defining paths, and defining basic functions
import os
import re
import json
import numpy as np
from numpy import random
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import networkx as nx
print(nx.__version__)
dir_project = os.path.realpath('..')
dir_data = os.path.join(dir_project, 'data')
dir_res = os.path.join(dir_project, 'results')
def grep(l, p):
return [i for i in l if p in i]
The data files are located in the data/ folder. To limit the scope of this project, I've decided to exclusively focus on the 'Way' (without treatment) and 'Bud' (with budesonide treatment) groups.
data_files = os.listdir(dir_data)
data_files
exp_way_1 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Way')[0]) , sep = '\t')
exp_way_2 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Way')[1]) , sep = '\t')
exp_way_3 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Way')[2]) , sep = '\t')
exp_bud_1 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Bud')[0]) , sep = '\t')
exp_bud_2 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Bud')[1]) , sep = '\t')
exp_bud_3 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Bud')[2]) , sep = '\t')
exp_way = pd.merge(exp_way_1, exp_way_2, on='Gene').merge(exp_way_3, on='Gene')
exp_bud = pd.merge(exp_bud_1, exp_bud_2, on='Gene').merge(exp_bud_3, on='Gene')
exp_all = pd.merge(exp_way, exp_bud, on='Gene')
exp_all = exp_all.rename(columns={'Gene': 'id',
'Way_1.cpm': 'way_1',
'Way_2.cpm': 'way_2',
'Way_3.cpm': 'way_3',
'Bud_1.cpm': 'bud_1',
'Bud_2.cpm': 'bud_2',
'Bud_3.cpm': 'bud_3',
})
print(exp_all.shape)
exp_all[:10]
Here we can see that the dataset contains a total of 83909 genes, and data collected from the six samples where three samples are without treatment ("way") and three are with budesonide treatment ("bud").
exp_all.describe()
As one can see here, the counts have been normalised through CPM (counts-per-million) conversion and each sample has the same average gene count. I am however unsure if these counts are logged (log2). If that's the case, the differential gene expression analysis may be affected by it, so ideally the counts shouldn't be logged beforehand.
plot_data = exp_all.iloc[:,1:].melt()
g = sns.boxplot(x = "variable", y = "value", data = plot_data)
g.set_yscale("log")
plt.show();
# Visualize with a pair-plot to study correlations between groups
# Since this is a quite intensive plot, I'll only plot a subset of the data
n_subset = 2000
idx = np.random.choice(len(exp_all), n_subset, replace=False)
sns.pairplot(exp_all.iloc[idx,1:], diag_kind="kde", kind="reg");
The data is generally heavily skewed towards lower counts (positively skewed), as is expected for most RNA seq data. Clearer linear correlations in the expression can moreover be seen within each group, as compared to between groups.
I couldn't find any modules in python that would perform differential expressions analysis (DEA). Among the ones I found, was to port the popular DESeq2 R algorithm using the rpy2 module, however I had issues to get it to work since it's reliant on your R installation.
Instead I built upon the code we worked with during the course. I reused Lukas' code for computing q-values, and then I wrapped it inside a function that I call dea(). This function performs student's t-test between the two groups for each gene, and also computes the log2 fold-change (log2(mean(post_group)/mean(pre_group))) for the genes. This is a very simplified way of doing DEA, but it seems to work somewhat well.
def dea(expr_df, n_replicates = 3, id_col = "id"):
"""
:param expr_df: Pandas dataframe containing genes as rows
and unique samples as columns. Currently limited to comparisons
between two groups, where the columns belonging to samples
within each group is next to one another. Currently, the fold-change is
computed as the mean of group 2 divided by the mean of group 1, i.e. the
group in the first columns serves as the reference group.
:param n_replicates: Number of samples or replicates within each group
:param id_col: column name for the gene identifier
"""
gene_ids = list(expr_df[id_col])
expr_data = expr_df.drop(columns=[id_col])
nrows_df = expr_df.shape[0]
output_m = np.zeros((nrows_df, 3))
for i in range(nrows_df):
a = expr_data.iloc[i, :n_replicates]
b = expr_data.iloc[i, n_replicates:]
output_m[i, 0:2] = stats.ttest_ind(b, a)
if np.isnan(output_m[i,1]):
output_m[i,1] = 1
a_avg = np.average(a) # without treatment
b_avg = np.average(b) # with treatment
# set 0-counts to non-zero to enable division and logging
if a_avg == 0:
a_avg = 0.0001
if b_avg == 0:
b_avg = 0.0001
log2fc = np.log2( b_avg / a_avg)
output_m[i, 2] = log2fc
output_a = pd.DataFrame(np.concatenate((output_m, np.array([gene_ids]).T), axis=1))
output_a.rename(columns = {0:'t_statistic', 1:'p_value', 2:'log2fc', 3:'id'}, inplace=True)
output_a[['t_statistic', 'p_value']] = output_a[['t_statistic', 'p_value']].astype(float)
qvals = qvalues(np.array(output_a[['p_value', 'id']]))
output_b = pd.DataFrame(qvals)
output_b.rename(columns = {0:'q_value', 1:'p_value', 2:'id'}, inplace=True)
output_df = pd.merge(expr_df, output_a, on='id')
output_df = pd.merge(output_df, output_b.drop(columns=['p_value']), on='id')
for col in ['t_statistic', 'p_value', 'log2fc']:
output_df[col] = pd.to_numeric(output_df[col])
output_df = output_df.sort_values(by=['p_value', 'q_value', 'log2fc'])
return output_df
def bootstrap(invec):
# Lukas' function
idx = np.random.randint(0, len(invec), len(invec))
return [invec[i] for i in idx]
def estimatePi0(p, numBoot=100, numLambda=100, maxLambda=0.95):
# Lukas' function
p.sort()
n = len(p)
lambdas = np.linspace(maxLambda / numLambda, maxLambda, numLambda)
Wls = np.array([n - np.argmax(p >= l) for l in lambdas])
pi0s = np.array([Wls[i] / (n * (1 - lambdas[i])) for i in range(numLambda)])
minPi0 = np.min(pi0s)
mse = np.zeros(numLambda)
for boot in range(numBoot):
pBoot = bootstrap(p)
pBoot.sort()
WlsBoot = np.array([n - np.argmax(pBoot >= l) for l in lambdas])
pi0sBoot = np.array([WlsBoot[i] / (n *(1 - lambdas[i])) for i in range(numLambda)])
mse = mse + np.square(pi0sBoot-minPi0)
minIx = np.argmin(mse)
return pi0s[minIx]
def qvalues(pvalues):
"""
Modification of Lukas' qvalues function.
Input: pvalues – should be in the format
of a numpy array and contain p-values in
the first column and sample id in the
second column.
"""
m = len(pvalues)
pvalues = pvalues[pvalues[:,0].argsort()] # sort by p-vals
#pvalues.sort()
pi0 = estimatePi0([p for p, gene in pvalues])
num_p, qs = 0.0, []
for p, gene in pvalues:
num_p += 1.0
q = pi0 * p * m / num_p
qs.append((q, p, gene))
qs.reverse()
old_q = 1.0
for ix in range(len(qs)):
q = min(old_q, qs[ix][0])
old_q = q
qs[ix] = (q, qs[ix][1], qs[ix][2])
qs.reverse()
return qs
First I'll do a pre-filtering of the data set before proceeding. In the article by the data owners, they performed a "prefiltering to exclude genes possessing ≤5 counts per million when summed over all experiments". However, I'd like to be even stricter in my filtering to ensure my gene set consists of a shorter list of genes which display greater differences. Also, since I do not include any shrinkage in my DEA (as compared to e.g. DESeq2) I think it'd better to ensure the genes I analyse have larger expression values.
I'll test to set the threshold to 20 instead of 5.
sum_cutoff = 20
exp_subset = exp_all.loc[(exp_all.iloc[:,1:].sum(axis=1) > sum_cutoff)]
print(exp_subset.shape)
exp_subset.sort_values(by="way_1")[:10]
Applying this cutoff gave me 12918 genes to work with.
Run the "differential gene expression analysis"
exp_dea = dea(exp_subset)
# Check the head of the results table from the DEA
exp_dea[:10]
print(exp_dea['q_value'].describe(), "\n")
print(exp_dea['p_value'].describe() )
Taking a quick look at the obtained p- and q-values, we can see that the lowest q-value however quite high (>0.05) which is probably due to the very large number of tests being performed.
# Some QC plots
f, axes = plt.subplots(2, 2, figsize=(16, 8))
sns.distplot(exp_dea['p_value'], bins = 100, kde = True, ax = axes[0,0])
sns.scatterplot(data = exp_dea, x = "p_value", y = "q_value", linewidth = 0, alpha = .1, ax = axes[0,1])
sns.scatterplot(data = exp_dea, x = "q_value", y = range(len(exp_dea["q_value"])), linewidth = 0, alpha = .1, ax = axes[1,0])
axes[1,0].set_xlim(0, 0.2)
axes[1,0].set_ylim(0, 5e3)
sns.scatterplot(data = exp_dea, x = "p_value", y = range(len(exp_dea["p_value"])), linewidth = 0, alpha = .1, ax = axes[1,1])
axes[1,1].set_xlim(0, 0.1)
axes[1,1].set_ylim(0, 5e3)
sns.despine()
plt.tight_layout();
Here we can see the p-value distribution, where it is also apparent that there is a rather smooth gradient between the
np.log2(1.5)
exp_dea["-log10_pval"] = -np.log10(exp_dea["p_value"])
exp_dea["abs_log2fc"] = np.abs(exp_dea['log2fc'])
exp_dea["cutoff_01"] = (exp_dea['abs_log2fc'] >= np.log2(1.5)) & (exp_dea['p_value'] < 0.01)
exp_dea["cutoff_001"] = (exp_dea['abs_log2fc'] >= 2) & (exp_dea['p_value'] < 0.001)
plt.figure(figsize=(16, 6))
sns.scatterplot(data = exp_dea, x = "log2fc", y = "-log10_pval",
linewidth = 0, alpha = .3, hue = "cutoff_01");
In order to create a network of pathways, I first need to identify which pathways that my differentially expressed genes (DEGs) belong to.
The steps that I need to take here can be broken down to:
p_cutoff = 0.01
fc_cutoff = np.log2(1.5)
degs_df = exp_dea.loc[(exp_dea['p_value'] < p_cutoff),]
degs_df.describe()
degs_df = degs_df.loc[(exp_dea['abs_log2fc'] >= fc_cutoff),]
degs_df.describe()
degs = degs_df["id"]
degs
I found a python module called mygene, which is a wrapper to access data from the MyGene.info services.
import mygene
mg = mygene.MyGeneInfo()
f = "name, symbol, pathway.reactome" # "all"
degs_mygene = mg.getgenes(degs, fields = f, as_dataframe = False)
# Look at structure of mygene output:
degs_mygene[:2]
degs_mygene_df = pd.DataFrame(degs_mygene)
degs_mygene_df[:5]
Some of the genes didn't seem to have any annotations in the MyGene database. These are reported in the column 'notfound', and will also have NaN in the 'pathway' column if no pathways were found to be associated with the gene.
degs_mygene_df.loc[degs_mygene_df.pathway.isnull()== True, ]
Therefore, I want to filter away these genes before proceeding with building the pathway network.
# Filter:
# - 1: Remove genes that were not found in mygene
genes_rm_1 = degs_mygene_df.loc[(degs_mygene_df.notfound == True), "query"]
print(genes_rm_1, '\n')
# Filter:
# - 2: Remove genes with no annotated pathways
genes_rm_2 = degs_mygene_df.loc[(degs_mygene_df.pathway.isnull()== True), "query"]
print(genes_rm_2, '\n')
genes_rm = list(genes_rm_1.index)
genes_rm.extend(list(genes_rm_2.index))
print("Genes without pathways:", len(genes_rm))
degs_mygene_df_filt = degs_mygene_df.drop(genes_rm)
degs_mygene_df_filt.shape
degs_mygene_df_filt = degs_mygene_df_filt.reset_index()
Out of the 399 filtered DEGs, 232 of them had pathways assigned to them
Next, I'll have to create a list of all unique reactome pathway IDs that I can use as a backbone for the network. Then, I'll populate an adjacency matrix by looking at each gene and connecting all the pathways that the gene belongs to. I'll also create a separate dictionary containg all pathway ids as keys and the number of genes annotated to that pathway as values.
pw_list = list(degs_mygene_df_filt["pathway"])
pw_ids = list()
for i in range(0,len(pw_list)):
if type(pw_list[i]) == dict: # do not consider genes which have no pathways assigned to it (should've been filtered)
if type(pw_list[i]['reactome']) == list: # append when a gene has multiple pws assigned
ids = [d['id'] for d in pw_list[i]['reactome']]
pw_ids.append(ids)
elif type(pw_list[i]['reactome']) == dict: # append when a gene has a single pw assigned
ids = pw_list[i]['reactome']['id']
pw_ids.append(ids)
# print(i, ids[0])
def flatten(items):
"""
Yield items from any nested iterable.
Code found here: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
"""
from collections import Iterable # < py38
for x in items:
if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
for sub_x in flatten(x):
yield sub_x
else:
yield x
pw_ids_flat = list(flatten(pw_ids)) # [item for sublist in pw_ids for item in sublist]
pw_ids_flat_unique = list(set(pw_ids_flat))
print("Total number of pathway IDs:", len(pw_ids_flat))
print("Number of unique pathway IDs:", len(pw_ids_flat_unique))
# Look at how the 'pw_ids' is structured
pw_ids[:3]
Create a weighted adjacency matrix. For every time a pathway shares a gene, it will increase it's weight by one. Thus, pathways which are commonly annotated together will show a high connectivity.
For every pathway, I'll also create a dictionary containing the number of occurances of that pathway (i.e. number of genes assigned to that pw).
pwd_list = pw_ids_flat_unique
n = len(pwd_list)
adj_matrix = np.zeros((n, n))
pw_dict = dict()
for pw in pwd_list:
pw_count = 0
pw_index = pwd_list.index(pw)
all_pws = [d for d in pw_ids if pw in d]
pw_dict[pw] = len(all_pws)
for gene_pws in all_pws:
if len(gene_pws[0])==1:
# Handle cases of single pathways, such as "R-HSA-400253"
gene_pw_index = pwd_list.index(gene_pw)
if pw_index != gene_pw_index:
adj_matrix[pw_index, gene_pw_index] += 1
adj_matrix[gene_pw_index, pw_index] += 1
else:
pw_count += 1
else:
for gene_pw in gene_pws:
gene_pw_index = pwd_list.index(gene_pw)
if pw_index != gene_pw_index:
adj_matrix[pw_index, gene_pw_index] += 1
adj_matrix[gene_pw_index, pw_index] += 1
else:
pw_count += 1
Look at adjacency matrix and pathway weights
print(adj_matrix.shape)
adj_matrix
# Show first 10 items in the pw_dict dictionary
import itertools
print(len(pw_dict))
dict(itertools.islice(pw_dict.items(), 10))
Plot to describe the pathway connectivity
# Look at pathway size (genes in pathway) distribution
plt.hist(pw_dict.values(), bins=100);
print("Median:", np.median(list(pw_dict.values())))
print("Average:", np.mean(list(pw_dict.values())))
Most pathways seems to have <10 genes connected to it based on the provided DEG list. However, there seems to be a few pathways with a very large number of genes in them
# look at what pathways have the most genes in them
filt_cutoff = 25
d = {k: v for k, v in pw_dict.items() if v >= filt_cutoff}
d = sorted(d.items(), key = lambda x: x[1], reverse=True)
d
# Look at pathway-pathway connectivity (adjacency matrix)
a = adj_matrix.sum(axis=1) # [adj_matrix.sum(axis=1)!=0] # if remove pathways with 0 connections
plt.scatter(a, range(0, len(a)), s = .5)
plt.xlabel("Total number of pathway occurances")
plt.ylabel("Index")
plt.show();
It seems most pathways doesn't have a lot of connections, while a few are very well connected with other pathways.
fname = "degs_mygene_df_filt.csv"
degs_mygene_df_filt.to_csv(os.path.join(dir_res, fname), index=False)
np.save(os.path.join(dir_res, "adj_matrix"), adj_matrix)
json.dump(pw_dict, open(os.path.join(dir_res, "pw_dict.txt"),'w'))
# Read objects:
degs_mygene_df_filt = pd.read_csv(os.path.join(dir_res, "degs_mygene_df_filt.csv"))
adj_matrix = np.load(os.path.join(dir_res, "adj_matrix.npy"))
pw_dict = json.load(open( os.path.join(dir_res, "pw_dict.txt") ))
Make pathway column in degs_mygene_df_filt searchable by setting it as a string (instead of dict), and create a function which outputs info regarding the searched pathway.
def identify_pathway(df = degs_mygene_df_filt, pw_search = "R-HSA-71291", print_genes = True):
# import json
df['pathway_str'] = df['pathway'].astype(str)
a = df[df['pathway_str'].str.contains(pw_search)]
a = a.reset_index()
pw_string = a['pathway_str'][0].replace("'on'", 'on') # there's a stupid case of 'on' within a string that messes things up, so here I'm removing the quotes around it
a_dict = json.loads(pw_string.replace("'", '"'))
if len(pw_search) > 5:
pw_info = next(item for item in a_dict['reactome'] if item["id"] == pw_search)
pw_name = pw_info['name']
else:
pw_name = "NA"
if print_genes:
pw_genes_df = pd.DataFrame({'gene_id': a['query'],
'gene_symbol': a['symbol'],
'gene_name': a['name']})
print(pw_search, ": ", pw_name, sep='')
print('\n', 'Genes in pathway:', sep='')
print(pw_genes_df)
return(pw_search, pw_name)
else:
return(pw_search, pw_name)
G = nx.from_numpy_matrix(adj_matrix)
G.size()
# Create label dict
g_labels = dict(zip(list(range(0, len(pw_dict))), list(pw_dict.keys())))
First I'll look at some various metrics regarding the network:
# Network node degree, i.e. the number of edges that are connected to the node
degree_list = [d for n, d in G.degree]
plt.hist(degree_list, bins=100)
plt.show();
np.mean(degree_list)
np.median(degree_list)
On average, each pathway/node is connected to 41.6 other nodes. But most of the nodes seems to fall below a degree of <50, while a few nodes have a very high degree (>300). The median degree for each node is 26.
# Average shortest path
nx.average_shortest_path_length(G)
# Network diameter, i.e the maximum shortest path length among any nodes in the network:
nx.diameter(G)
One can traverse the network relatively fast it seems, since the network diameter is not more than 5 and the average shortest path in the network is around 2.3.
Cliques
A clique is a subset of nodes in the graph such that every two distinct nodes in the clique are adjacent. I can identify the maximum number of cliques in the network by using the find_cliques function.
A maximum clique is one that is not a subset of any other larger clique.
cliques = nx.find_cliques(G)
len(list(cliques))
Then let's have an actual look at the network graph:
nx.draw(G, node_size = 10, edge_color = 'gray', weight=1);
The network currently looks like a big ball of nodes, which I think makes sense for this kind of data.
To prune the network a bit, it can also be useful to look at the minimum spanning tree, where the network has been trimmed to it's core since we only keep the edges that connects all nodes with the minimum number of edges possible.
tree = nx.minimum_spanning_tree(G)
nx.draw(tree, node_size = 5, edge_color = 'gray', weight=1);
Refinement of the network visualization
To make the graph a bit more visually appealing, I want to draw it where the edge widths corresponds to their weight, and the node sizes corresponds to the number of genes sharing that pathway. I'll also throw in some coloring of the nodes according to their size.
edges = G.edges()
#colors = [G[u][v]['color'] for u,v in edges]
weights = [G[u][v]['weight']/10 for u,v in edges]
n_size = [v * 10 for v in pw_dict.values()]
plt.figure(figsize=(15,15),dpi=300)
nx.draw(G,
edges = edges,
width = weights,
edge_color = 'gray',
# labels=g_labels,
# with_labels = True, font_size=6,
# nodelist=g_labels,
node_size = n_size,
node_color = n_size
#cmap = plt.cm.YlOrRd
)
plt.tick_params(
axis = 'both', # changes apply to both axes
which = 'both', # both major and minor ticks are affected
bottom = 'off', # ticks along the bottom edge are off
top = 'off', # ticks along the top edge are off
labelbottom = 'off', # labels along the bottom edge are off
left = 'off',
labelleft = 'off')
plt.show;
This looks a lot nicer!
Now ideally, I would want to be able to detect what nodes corresponds to what pathway in a easy and interactive way. It would also be great to include some form of pathway classification that can serve as color-coding for the nodes. That would make it easier to for instance spot a cluster of inflammation-related pathways.
To make the graph interactive, I found a way to do it using the python module 'mpld3' (https://mpld3.github.io/) which combines matplotlib with D3.
Initially, the mpld3 didn't want to work with my networkx object. The error I got said: "Object of type ndarray is not JSON serializable".
To solve this issue I did the following:
python -m pip install --user "git+https://github.com/javadba/mpld3@display_fix" import networkx to ~/.local/lib/python3.7/site-packages/mpld3/_display.py elif isinstance(obj,networkx.classes.reportviews.NodeView):return list(obj)default() function in class NumpyEncoder to make it work (also in _display.py)import mpld3
from mpld3 import enable_notebook
enable_notebook()
I also want to find out a bit more information about the most connected pathways, and I can do so by picking out the top most connected pathways and using my custom identify_pathway() function to see what pathways the pathway IDs corresponds to.
pw_details = list()
for pw in list(pw_dict.keys()):
#print(pw)
pw_details.append(identify_pathway(pw_search = pw, print_genes = False))
pw_details[:5]
Since I couldn't find any quick way to extract pathways classifications from Reactome, I came up with my own way to identify a few selected categories based on identifying certain keywords from the pathway names. The categories I decided to look for were "signaling", "immune", "metabolism", "development", "mitochondria", and "hypoxia". All other pathways not found using these keywords will be classified as "other".
# Example of a simple search
sub = "DNA"
[s for s in pw_details if sub in s[1]]
# Defining pathway classes
col_list = ["#BED7D1", "#FDDFDF", "#F7EBC3", "#c1f5c5", "#c2e9f2", "#DBC4DF", "#e3b78f"]
pw_class = list()
for i in range(len(pw_details)):
pw_id = pw_details[i][0]
pw_name = pw_details[i][1]
#print(pw_name)
if ("ignaling" or "ignalling") in pw_name:
info = [pw_id, pw_name, "Signaling", col_list[0]]
#print(info)
pw_class.append(info)
elif ("mmune") in pw_name:
info = [pw_id, pw_name, "Immune", col_list[1]]
pw_class.append(info)
elif ("etabolism") in pw_name:
info = [pw_id, pw_name, "Metabolism", col_list[2]]
pw_class.append(info)
elif ("evelopment") in pw_name:
info = [pw_id, pw_name, "Development", col_list[3]]
pw_class.append(info)
elif ("itochondri") in pw_name:
info = [pw_id, pw_name, "Mitochondria", col_list[4]]
pw_class.append(info)
elif ("ypoxia") in pw_name:
info = [pw_id, pw_name, "Hypoxia", col_list[5]]
pw_class.append(info)
elif ("DNA") in pw_name:
info = [pw_id, pw_name, "DNA", col_list[6]]
pw_class.append(info)
else:
info = [pw_id, pw_name, "Other", "#EDF0F0"]
pw_class.append(info)
pw_class_label = [s[2] for s in pw_class]
pw_class_label_color = [s[3] for s in pw_class]
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=col_list[0], label='Signaling', edgecolor='w'),
Patch(facecolor=col_list[1], label='Immune', edgecolor='w'),
Patch(facecolor=col_list[2], label='Metabolism', edgecolor='w'),
Patch(facecolor=col_list[3], label='Development', edgecolor='w'),
Patch(facecolor=col_list[4], label='Mitochondria', edgecolor='w'),
Patch(facecolor=col_list[5], label='Hypoxia', edgecolor='w'),
Patch(facecolor=col_list[6], label='DNA-related', edgecolor='w'),
Patch(facecolor="#E4DFD9", label='Other', edgecolor='w')]
Now we can draw the network and color by pathway class, and include node labels which tells us what pathway the node corresponds to:
random.seed(101)
edges = G.edges()
weights = [G[u][v]['weight']/20 for u,v in edges]
n_size = [(v+1) * 2 for v in pw_dict.values()]
pos = nx.spring_layout(G)
fig, ax = plt.subplots(figsize=(8,6), dpi=100)
scatter = nx.draw_networkx_nodes(G, pos, ax = ax,
#nodelist = g_labels,
node_size = n_size,
node_color = pw_class_label_color)
nx.draw_networkx_edges(G, pos, ax=ax,
edges = edges,
width = weights,
edge_color = 'gray',)
ax.legend(handles = legend_elements, loc='best')
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=pw_details)
mpld3.plugins.connect(fig, tooltip)
As one can see from this graph, it seems that pathways belonging to the same class generally cluster together in the network. For instance, there is a small cluster of mitochondria/translation-related pathways bundled together. Also, the pathways involving metabolism can mostly be seen to share a lot of connections. Some clusters with pathways which haven't been classified ("other") also make a lot of sense, such as the cluster of pathways corresponding to various cell cycle checkpoints or the one corresponding to gap junction-related pathways. It's also not too surprising that the pathways related to signaling can be seen a bit all over the network, since it is a very vague classification and these pathways could be part in many different cellular processes.
I can also do the same with the minimum spanning tree instead;
random.seed(101)
edges = tree.edges()
weights = [tree[u][v]['weight']/10 for u,v in edges]
n_size = [v + 2 for v in pw_dict.values()]
pos = nx.spring_layout(tree)
fig, ax = plt.subplots(figsize=(8,6), dpi=100)
scatter = nx.draw_networkx_nodes(tree, pos, ax = ax,
node_size = n_size,
#node_size = 5,
alpha = 0.85,
node_color = pw_class_label_color)
nx.draw_networkx_edges(tree, pos, ax=ax,
edges = edges,
width = weights,
edge_color = 'gray',)
ax.legend(handles = legend_elements, loc='best')
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=pw_details)
mpld3.plugins.connect(fig, tooltip)
Identification of "top nodes"
I'd like to see what the pathways having the highest connectivity and that are of largest size corresponds to.
To do this I'll again look at the top largest pathways and then feed them to my identify_pathway() function.
# look at what pathways have the most genes in them
filt_cutoff = 20
d = {k: v for k, v in pw_dict.items() if v >= filt_cutoff}
d = sorted(d.items(), key = lambda x: x[1], reverse=True)
for p in d:
pw = p[0]
#print(pw)
print(identify_pathway(pw_search = pw, print_genes = False), "Size:", p[1], "DEGs")
# look at what pathways are the most connected ones
filt_cutoff = 200
d = {k: v for k, v in dict(G.degree).items() if v >= filt_cutoff}
d = sorted(d.items(), key = lambda x: x[1], reverse=True)
for p in d:
pw = g_labels[p[0]]
#print(pw)
print(identify_pathway(pw_search = pw, print_genes = False), "Degree:", p[1], "edges")
Here we can see that the lagest and most connected pathways are quite generic pathways such as 'Signal Transduction', 'Immune system', and 'Metabolism'. These are probably not the most interesting pathways for this analysis through, since they are part of so many processes.
Project outcomes in brief
I was able to process a RNA-seq dataset from count matrices until the generation of a pathway network based on differentially expressed genes. The obtained network can be used to study what pathways are more highly "enriched" between the two sample conditions; Budesonide treated and untreated lung cultures. Genes belonging to metabolic pathways, signaling pathways, immune system related pathways, and hypoxia pathways are some examples of pathways that I was able to pick up and visualize in the network. Without much further details about whether the identified pathways are activated or deactivated in the treatment group, it is hard to draw any real conclusions from these results. However, with the time limitation of this course project the scope had to be limited, and I am still happy that I managed to get this far with this kind of analysis by exclusively working in Python and performing a lot of things from scratch that I had very limited knowledge of beforehand.
Short "disclaimer" about the pathway analysis
This is not a true pathway enrichment that I've done here, since ideally you'd also want to take the pathway size (i.e. number of genes annotated within a pathway from the database) into consideration. Some pathways involve hundreds of genes, and seeing many genes from our DEG list within such pathways wouldn't be very surprising in that case, while if a pathway only has ten genes associated with it and nine of them can be found within our DEG list that would be a finding with a lot more significance. In many pathway enrichment tools, they report the percentages of pathway genes found from your given gene list as well as a statistical test, such as Fisher's exact test, where multiple-test correction is also applied.
import datetime
print('Last update done: ', datetime.datetime.now(), '\n',
"Lovisa Franzén", sep='')